Word Segmentation for Urdu OCR System

نویسندگان

  • Misbah Akram
  • Sarmad Hussain
چکیده

This paper presents a technique for Word segmentation for the Urdu OCR system. Word segmentation or word tokenization is a preliminary task for understanding the meanings of sentences in Urdu language processing. Several techniques are available for word segmentation in other languages but not much work has been done for word segmentation of Urdu Optical Character Recognition (OCR) System. A methodology is proposed for word segmentation in this paper. It finds the boundaries of words in a sequence of ligatures using probabilistic formulas, by utilizing the knowledge of collocation of ligatures and words in the corpus. The word identification rate using this technique is 96.10% with 65.63% unknown words identification rate. Keywords–Word Segmentation; Urdu OCR System; Urdu Ligature; Word langauge model; Ligature langauge model

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optical Character Recognition System for Urdu Words in Nastaliq Font

Optical Character Recognition (OCR) has been an attractive research area for the last three decades and mature OCR systems reporting near to 100% recognition rates are available for many scripts/languages today. Despite these developments, research on recognition of text in many languages is still in its early days, Urdu being one of them. The limited existing literature on Urdu OCR is either l...

متن کامل

Arabic & Urdu Text Segmentation Challenges & Techniques

Text Segmentation is one of the critical and vital step in OCR system of any language because accuracy of OCR depends upon correctly segmented characters. Segmentation divide the text images into its constituent parts (i.e. lines, components or words and individual characters). As Urdu and Arabic are highly cursive and context sensitive in nature and have improper space between words therefore,...

متن کامل

Segmentation of Nastaliq Script for OCR

In this paper we have presented a novel segmentation technique for the implementation of an OCR (Optical Character Recognition) for printed Nastalique text, a calligraphic style of Urdu which uses the Arabic script for its writing. OCR for many of the world major languages have been developed and are being used but at present an OCR for Nastalique is not available and the published research on ...

متن کامل

A Word Segmentation System for Handling Space Omission Problem in Urdu Script

Word Segmentation is the foremost obligatory task in almost all the NLP applications, where the initial phase requires tokenization of input into words. Like other Asian languages such as Chinese, Thai and Myanmar, Urdu also faces word segmentation challenges. Though the Urdu word segmentation problem is not as severe as the other Asian language, since space is used for word delimitation, but t...

متن کامل

Line and Ligature Segmentation in Printed Urdu Document Images

This paper presents a technique for segmentation of printed Urdu text images into lines and ligatures, a key pre-processing step in Urdu Optical Character Recognition (OCR) systems. Unlike classical projection profile based line segmentation methods, the proposed scheme successfully segments overlapping and touching lines. Once the lines are segmented, ligatures are extracted from each text lin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010